Web Directories as Training Data for Automated Metadata Extraction

نویسندگان

  • Martin Kavalec
  • Petr Strossa
چکیده

Although man-made annotations are considered as the main ‘knowledge fuel’ for the Semantic Web, the majority of existing commercial pages are still poorly equipped with any kind of metadata, never mind the forthcoming standards such as the RDF syntax or the Dublin Core semantics. Information Extraction, relying on characteristic patterns in text, can be applied even on such ‘legacy’ pages, in order to obtain metadata containing, for example, the names, types, and domains of activity of the WWW subjects (companies).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Automated Classification of Web Sites

In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata in relation to requirements for text features. We find that HTML metatags are a good source of text features, but are not in wide use despite their role in search engine rankings. We present an approach for targeted spidering including metadata extract...

متن کامل

Practical Issues for Automated Categorization of Web Sites

In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata and requirements for text features. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categ...

متن کامل

Ontea: Platform for Pattern Based Automated Semantic Annotation

Automated annotation of web documents is a key challenge of the Semantic Web effort. Semantic metadata can be created manually or using automated annotation or tagging tools. Automated semantic annotation tools with best results are built on various machine learning algorithms which require training sets. Other approach is to use pattern based semantic annotation solutions built on natural lang...

متن کامل

Towards Large Scale Semantic Annotation Built on MapReduce Architecture

Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging too...

متن کامل

Automatic Generation of RDF Metadata final version

The Resource Description Framework (RDF[9]) has been developed to fulfil the need for a mechanism for resource description within the Web's architecture. With over 320 million[10] individually accessible objects on the Web, the ability to describe each one so that it can be conceptualized without being accessed and analyzed is increasingly important. This paper describes how an automatic classi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001